Project Home

All packages used for analysis and figures in this page:

# General data wrangling
library(tidyverse)
library(DT)
library(readxl)

# Visualization
library(plotly)
library(colorRamps)
library(RColorBrewer)
library(colorspace)
library(NeuralNetTools)
library(gridExtra)
library(ggplotify)
library(knitr)
library(kableExtra)

# Modeling
library(glmnet)
library(caret)
library(ranger)

# Ensemble building
library(caretEnsemble)

# Saving RData objects
library(cgwtools)

This page focuses on developing both individual models and model ensembles to predict annual change in CDR-Sum of Boxes based on principle components (PCs) derived from regional tau-PET uptake per year, age at baseline, and sex. This data was prepared in Data Preparation and is the PCA-transformed dataset. The same models are applied to non-PCA-transformed data in Original Modeling to compare the results here with those from the original ROIs.

Data preparation

Load data

First, load in the data prepared in the previous data preparation phase:

# NOTE: RData files are not on GitHub as they contain real ADNI information.
# They may be recreated by following along with the code included in this GitHub Pages website.
load("../RData/Prepared_Data.RData")

Note: I am not including the RData in my GitHub repo as it contains real ADNI data. The data contained in the .RData files can be recreated by following along beginning with the Data Understanding page.

Data split for train + test

As per standard convention in model development, I will randomly partition this dataset into training and testing subsets, such that the models are trained and evaluated in independent data subsets. I will use the sample function with a specific seed set (127) to partition the data into 75% training and 25% testing.

# Set seed for consistency in random sampling for 10-fold cross-validation
set.seed(127)
train.index <- sample(nrow(post.pca), nrow(post.pca)*0.75, replace=F)

# Remove unneccessary identifier info from datasets for modeling
pca.df <- post.pca %>% ungroup() %>% select(-RID, -interval_num)

# Pre-processing will be applied in model training with caret

# Subset training + test data for PCA-transformed data
pca.train <- pca.df[train.index, ]
pca.test <- pca.df[-train.index, ]

Individual model training and tuning

I will be using the caretEnsemble package to compile four individual regression models into a stacked ensemble. This package enables evaluation of the models individually as well as together in the ensemble.

The first step is to create a caretList of the four regression models I will use:

  • Elastic net regression (glmnet)
  • k-nearest neighbors regression (knn)
  • Neural network regression (nnet)
  • Random forest regression (ranger)

I will use the trainControl function to specify ten-fold cross-validation with parallel processing.

# trainControl for 10-fold CV and parallel processing
ensemble.control <- trainControl(method="cv", number=10, allowParallel=T)

I’m using the RMSE as the metric according to which model parameters are tuned. All input data (which includes training and test data) will be automatically standardized using the preProcess center + scaling feature.

# Set seed for consistency
set.seed(127)

# Neural net -- linout=T means linear output (i.e. not constrained to be [0,1]),
# try a range of hidden layer sizes and weight decays
my.neuralnet <- caretModelSpec(method="nnet", linout=T, trace=F, tuneGrid = expand.grid(size=c(1, 3, 5, 10, 15), decay = seq(0, 1, by=0.2)))

# k-nearest neighbors: try 20 different values of k
my.knn <- caretModelSpec(method="knn", tuneLength=20)

# random forest: try 15 different numbers of features considered at each node and use 500 sampled trees
my.randomforest <- caretModelSpec(method="ranger", tuneLength=15, num.trees=500, importance="permutation")

# elastic net: try four different values of alpha for ridge/lasso blending and four lambda values for coefficient penalty
my.elasticnet <- caretModelSpec(method="glmnet",tuneGrid=expand.grid(alpha=c(0,0.1,0.6,1), lambda=c(5^-5,5^-3,5^-1,1)))

# Compile individual models into one cohesive model list using caretList
invisible(capture.output(ensemble.models <- caretList(CDRSB ~ ., 
                                                      data=pca.train, 
                                                      trControl=ensemble.control, 
                                                      metric="RMSE", 
                                                      preProcess=c("center", "scale"),
                                                      tuneList=list(my.neuralnet, my.knn, my.randomforest, my.elasticnet))))

The final chosen parameters for each model can be viewed:

Elastic net final model

# Print elastic net model summary
ensemble.models$glmnet

# Elastic net cross-validation results
glmnet.alpha <- ensemble.models$glmnet$results$alpha
glmnet.rmse <- ensemble.models$glmnet$results$RMSE
glmnet.mae <- ensemble.models$glmnet$results$MAE
glmnet.lambda <- ensemble.models$glmnet$results$lambda

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.glmnet.cv <- data.frame(alpha=glmnet.alpha, RMSE=glmnet.rmse, MAE=glmnet.mae, lambda=glmnet.lambda) %>%
  mutate(alpha=as.character(alpha)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=lambda, y=Value, color=alpha)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=alpha)) +
  theme_minimal() +
  ggtitle("Elastic Net Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.glmnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "lambda", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "lambda", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

# Clear workspace
rm(glmnet.alpha, glmnet.rmse, glmnet.lambda, glmnet.mae, p.glmnet.cv, train.index)
## glmnet 
## 
## 246 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda   RMSE      Rsquared    MAE      
##   0.0    0.00032  1.120429  0.09655352  0.7065869
##   0.0    0.00800  1.120429  0.09655352  0.7065869
##   0.0    0.20000  1.111204  0.09264358  0.6992882
##   0.0    1.00000  1.092248  0.08877960  0.6888541
##   0.1    0.00032  1.121643  0.09689871  0.7074884
##   0.1    0.00800  1.121216  0.09676384  0.7070698
##   0.1    0.20000  1.103896  0.10168777  0.6900957
##   0.1    1.00000  1.078150  0.05497418  0.6799398
##   0.6    0.00032  1.121655  0.09694159  0.7074408
##   0.6    0.00800  1.119002  0.09673422  0.7044619
##   0.6    0.20000  1.080718  0.05926795  0.6763468
##   0.6    1.00000  1.072871         NaN  0.6855987
##   1.0    0.00032  1.121677  0.09694846  0.7074455
##   1.0    0.00800  1.117345  0.09722338  0.7024775
##   1.0    0.20000  1.075379  0.06960295  0.6799288
##   1.0    1.00000  1.072871         NaN  0.6855987
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.6 and lambda = 1.

Based on RMSE, lambda=1 was chosen – this is the largest penalty value of the four assessed lambda values. Additionally, alpha=0.6 was chosen, meaning the model is a blend of ridge and lasso regression, leaning slightly toward lasso regression (which uses L1 regularization).

 

kNN final model

# Print kNN model summary
ensemble.models$knn

# kNN cross-validation plot
knn.k <- ensemble.models$knn$results$k
knn.rmse <- ensemble.models$knn$results$RMSE
knn.mae <- ensemble.models$knn$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.knn.cv <- data.frame(k=knn.k, RMSE=knn.rmse, MAE=knn.mae) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=k, y=Value, color=Metric)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ggtitle("kNN Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.knn.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "k", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "k", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(knn.rmse, knn.k, p.knn.cv, knn.mae)
## k-Nearest Neighbors 
## 
## 246 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared    MAE      
##    5  1.095632  0.08289345  0.6389863
##    7  1.099697  0.05080179  0.6478129
##    9  1.095010  0.03837144  0.6484233
##   11  1.084744  0.03278807  0.6288731
##   13  1.081373  0.03914047  0.6244824
##   15  1.073794  0.05760648  0.6188573
##   17  1.070625  0.06005288  0.6145598
##   19  1.077580  0.04757898  0.6178363
##   21  1.078646  0.04796042  0.6193779
##   23  1.080235  0.04242322  0.6200468
##   25  1.079717  0.04149783  0.6173809
##   27  1.072803  0.05326838  0.6150709
##   29  1.071514  0.05248265  0.6134723
##   31  1.071258  0.05519706  0.6133139
##   33  1.074527  0.04573226  0.6114621
##   35  1.072335  0.05369971  0.6078421
##   37  1.072503  0.05093173  0.6069605
##   39  1.072482  0.05254741  0.6071913
##   41  1.072883  0.05328870  0.6059066
##   43  1.073248  0.04967071  0.6059558
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

k=17 was chosen to minimize the RMSE. This value is also a local minima for the MAE, though the MAE decreases with larger values of k.

 

Neural network final model

# Print neural network model summary
ensemble.models$nnet

# Neural network cross-validation plot
n.neurons <- ensemble.models$nnet$results$size
nnet.rmse <- ensemble.models$nnet$results$RMSE
nnet.mae <- ensemble.models$nnet$results$MAE
nnet.weight <- ensemble.models$nnet$results$decay

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.nnet.cv <- data.frame(n.neurons, RMSE=nnet.rmse, MAE=nnet.mae, decay=nnet.weight) %>%
  mutate(decay=as.character(decay)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=n.neurons, y=Value, color=decay)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=decay)) +
  theme_minimal() +
  ggtitle("Neural Network Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.nnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Neurons in Hidden Layer", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Neurons in Hidden Layer", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(n.neurons, nnet.rmse, nnet.mae, nnet.weight, p.nnet.cv)
## Neural Network 
## 
## 246 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared    MAE      
##    1    0.0    1.245644  0.09270336  0.7320901
##    1    0.2    1.192384  0.05973576  0.7215185
##    1    0.4    1.167875  0.06849000  0.6952365
##    1    0.6    1.166977  0.09277111  0.7182222
##    1    0.8    1.138191  0.07626984  0.7016022
##    1    1.0    1.147937  0.11282020  0.7182524
##    3    0.0    1.279886  0.08310500  0.7504749
##    3    0.2    1.245645  0.07589623  0.7871604
##    3    0.4    1.199659  0.08438305  0.7179262
##    3    0.6    1.153764  0.08378698  0.7050315
##    3    0.8    1.151742  0.05853814  0.6968032
##    3    1.0    1.139317  0.04608047  0.6920970
##    5    0.0    2.852068  0.10901075  1.2135027
##    5    0.2    1.186223  0.07133170  0.7566989
##    5    0.4    1.299139  0.05493359  0.8220907
##    5    0.6    1.240741  0.03175004  0.7999623
##    5    0.8    1.150480  0.03837154  0.7027839
##    5    1.0    1.152905  0.04999714  0.7128881
##   10    0.0    1.617319  0.04562712  1.1305324
##   10    0.2    1.312021  0.05878781  0.8568002
##   10    0.4    1.283676  0.06840569  0.8449359
##   10    0.6    1.172149  0.05627968  0.7334233
##   10    0.8    1.163288  0.05143240  0.7458729
##   10    1.0    1.186060  0.04664239  0.7251889
##   15    0.0    1.828227  0.10215527  1.2971160
##   15    0.2    1.440809  0.05869640  0.9635911
##   15    0.4    1.286711  0.03298476  0.8384455
##   15    0.6    1.241962  0.06854765  0.7906647
##   15    0.8    1.184070  0.06865887  0.7409692
##   15    1.0    1.159711  0.02835005  0.7211840
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.8.

The optimal neural network based on RMSE includes on neuron in the hidden layer, which receives input from all 33 input nodes (31 ROIs, baseline age, and sex) and outputs onto the final prediction node. The decay weight was chosen as 0.8. Here’s a graphical representation of this caret-trained neural network:

par(mar = numeric(4))
plotnet(ensemble.models$nnet, cex_val=0.8, pad_x=0.6, pos_col="firebrick3", neg_col="dodgerblue4",
        circle_col="lightslategray", bord_col="lightslategray", alpha_val=0.4)

Age at baseline, PC1, PC5, and sex (male) all exert negative weights onto the hidden layer, while PC2, PC3, and PC4 exert positive weights.

Random forest final model

# Print random forest model summary
ensemble.models$ranger

# Random forest cross-validation plot
splitrule <- ensemble.models$ranger$results$splitrule
numpred <- ensemble.models$ranger$results$mtry
rf.rmse <- ensemble.models$ranger$results$RMSE
rf.mae <- ensemble.models$ranger$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.rf.cv <- data.frame(splitrule, RMSE=rf.rmse, MAE=rf.mae, numpred) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=numpred, y=Value, color=splitrule)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=splitrule)) +
  theme_minimal() +
  ggtitle("Random Forest Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.rf.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Predictors in Decision Node", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Predictors in Decision Node", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(splitrule, numpred, rf.rmse, rf.mae, p.rf.cv)
## Random Forest 
## 
## 246 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared    MAE      
##   2     variance    1.087704  0.09159020  0.6645584
##   2     extratrees  1.076396  0.04057227  0.6589970
##   3     variance    1.097318  0.09194553  0.6706066
##   3     extratrees  1.075592  0.05217097  0.6561724
##   4     variance    1.105181  0.09240744  0.6759372
##   4     extratrees  1.079706  0.05795037  0.6574317
##   5     variance    1.109070  0.09926728  0.6742750
##   5     extratrees  1.084472  0.05483920  0.6599709
##   6     variance    1.106586  0.09518613  0.6773890
##   6     extratrees  1.089994  0.05894553  0.6637487
##   7     variance    1.110481  0.09341193  0.6773223
##   7     extratrees  1.094493  0.05715666  0.6636269
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 3, splitrule = extratrees and min.node.size = 5.

An mtry of 3 with an extratrees split rule yielded the lowest RMSE, meaning three predictor terms are considered at each decision node. This combination also yielded the lowest MAE value.

Individual model performance

These four models can be resampled and aggregated using the resamples function:

# Resample the performance of this ensemble and report summary metrics for MAE, RMSE, and R2
set.seed(127)
ensemble.results <- resamples(ensemble.models)
summary(ensemble.results)
## 
## Call:
## summary.resamples(object = ensemble.results)
## 
## Models: nnet, knn, ranger, glmnet 
## Number of resamples: 10 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## nnet   0.4206481 0.5442455 0.7859245 0.7016022 0.8214511 0.9074631    0
## knn    0.3460927 0.4526976 0.6951723 0.6145598 0.7222040 0.8729141    0
## ranger 0.4110419 0.5181849 0.6957156 0.6561724 0.7555823 0.9238009    0
## glmnet 0.4474888 0.5692098 0.7479911 0.6855987 0.7901979 0.9026879    0
## 
## RMSE 
##             Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## nnet   0.5935476 0.7122655 1.244294 1.138191 1.393087 1.877673    0
## knn    0.4889936 0.6738322 1.140509 1.070625 1.387874 1.874883    0
## ranger 0.6181761 0.7294735 1.101456 1.075592 1.301085 1.863276    0
## glmnet 0.5007390 0.7217365 1.101043 1.072871 1.375093 1.818658    0
## 
## Rsquared 
##               Min.    1st Qu.     Median       Mean    3rd Qu.      Max. NA's
## nnet   0.001466325 0.03338898 0.06436858 0.07626984 0.08118771 0.2525644    0
## knn    0.004153727 0.01810022 0.03311722 0.06005288 0.07510692 0.2381806    0
## ranger 0.003690445 0.01414694 0.03767737 0.05217097 0.08695230 0.1530365    0
## glmnet          NA         NA         NA        NaN         NA        NA   10

The values of NA for the glmnet model \(R^2\) indicate that the model outputs all the same predictions, possibly indicating that the model only includes the intercept term after coefficient shrinking. The kNN model has the lowest mean MAE and RMSE out of all the models.

It’s also useful to look at the regression correlation between these component models:

cat("\nRoot-Mean Square Error Correlation\n")
# Calculate model RMSE correlations
modelCor(ensemble.results, metric="RMSE")
cat("\n\nMean Absolute Error Correlation\n")
# Calculate model MAE correlations
modelCor(ensemble.results, metric="MAE")
## 
## Root-Mean Square Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.9889090 0.9867333 0.9820596
## knn    0.9889090 1.0000000 0.9918527 0.9957175
## ranger 0.9867333 0.9918527 1.0000000 0.9873565
## glmnet 0.9820596 0.9957175 0.9873565 1.0000000
## 
## 
## Mean Absolute Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.9664251 0.9553354 0.9660434
## knn    0.9664251 1.0000000 0.9838072 0.9753176
## ranger 0.9553354 0.9838072 1.0000000 0.9632123
## glmnet 0.9660434 0.9753176 0.9632123 1.0000000

The models are extremely correlated in terms of error (all R>0.95). This is not ideal, since ensemble models are designed to counterbalance individual model error. The error can be visualized with confidence intervals using the dotplot function:

# Plot the four models' RMSE values
p.rmse <- as.ggplot(dotplot(ensemble.results, metric="RMSE"))
# Plot the four models' MAE values
p.mae <- as.ggplot(dotplot(ensemble.results, metric="MAE")) 
# Combine plots side-by-side
grid.arrange(p.rmse, p.mae, ncol=2)

The models all show very similar RMSE confidence intervals. The kNN model shows a lower MAE than the other three models, while the neural network showed the highest MAE value. Despite the large error correlation between the models, I’ll move forward to see if ensembling strengthens the overall predictions for the change in CDR-Sum of Boxes over time.

Model ensemble training and tuning

Genereralized linear ensemble

Starting with the basic caretEnsemble function, which by default employs a generalized linear model to combine the component models:

set.seed(127)
# Set trainControl --> 10-fold CV with parallel processing
ensemble.control <- trainControl(method="repeatedcv", number=10, allowParallel = T)

# Create stacked ensemble, optimizing for RMSE
stacked.ensemble.glm <- caretEnsemble(ensemble.models, metric="RMSE", trControl=ensemble.control)
summary(stacked.ensemble.glm)
## The following models were ensembled: nnet, knn, ranger, glmnet 
## They were weighted: 
## 3.3024 -1.0743 -0.307 1.0037 -8.0853
## The resulting RMSE is: 1.0619
## The fit for each individual model on the RMSE is: 
##  method     RMSE    RMSESD
##    nnet 1.138191 0.4385607
##     knn 1.070625 0.4549577
##  ranger 1.075592 0.4061609
##  glmnet 1.072871 0.4254785

All the models show similar RMSE values, but kNN offers the lowest by a slim margin. caret offers a function autoplot to create in-depth diagnostic plots for ensemble models:

autoplot(stacked.ensemble.glm)

The top left graph shows the mean cross-validated RMSE per model, with the bars denoting RMSE standard deviation. The red dashed line shows the RMSE of the ensemble model, which is slightly lower than that of any of the individual models. The middle-left plot shows a large negative weight associated with the elastic net model, smaller weights associated wtih kNN, neural network, and random forest, and a larger positive weight for the intercept.

Stacked ensembles

Stacked ensembles can also be constructed using caretStack, which applies user-defined linear combinations of each constituent model. I’ll try one using a random forest combination of models and one using an elastic net combination of models.

set.seed(127)
# Create random forest-combined stacked ensemble
stacked.ensemble.rf <- caretStack(ensemble.models, method = "rf", metric = "RMSE", trControl = ensemble.control)
# Create elastic net-combined stacked ensemble
stacked.ensemble.glmnet <- caretStack(ensemble.models, method="glmnet", metric="RMSE", trControl = ensemble.control)
The summary statistics for each model can be displayed:
cat("\nStacked ensemble, generalized linear model:\n") 
stacked.ensemble.glm 
cat("\n\nStacked ensemble, random forest:\n")
stacked.ensemble.rf
cat("\n\nStacked ensemble, elastic net:\n")
stacked.ensemble.glmnet
## 
## Stacked ensemble, generalized linear model:
## A glm ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE      
##   1.061885  0.08684754  0.6777372
## 
## 
## 
## Stacked ensemble, random forest:
## A rf ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Random Forest 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE      
##   2     1.144390  0.06698688  0.6945539
##   3     1.162099  0.06489556  0.7019328
##   4     1.178654  0.06058592  0.7048839
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
## 
## 
## Stacked ensemble, elastic net:
## A glmnet ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## glmnet 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 220, 222, 221, 221, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE      Rsquared   MAE      
##   0.10   0.0003874836  1.035647  0.1052673  0.6773436
##   0.10   0.0038748363  1.035568  0.1052222  0.6772353
##   0.10   0.0387483634  1.034216  0.1040302  0.6752533
##   0.55   0.0003874836  1.035676  0.1052599  0.6773358
##   0.55   0.0038748363  1.035463  0.1050010  0.6769625
##   0.55   0.0387483634  1.033049  0.1031535  0.6730375
##   1.00   0.0003874836  1.035682  0.1052608  0.6773399
##   1.00   0.0038748363  1.035382  0.1047341  0.6766830
##   1.00   0.0387483634  1.032724  0.1026793  0.6725872
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.03874836.

The elastic net-stacked ensemble model shows the lowest cross-validated RMSE and MAE of the three ensembles, though this difference may not be significant and may not translate to the out-of-sample data. These three ensemble models (glm-, rf-, and glmnet-combined) will be used to predict annual change in CDR-Sum of Boxes in the same test dataset. Note: since the models were created with center and scale preprocessing specified, the test data does not need to be manually pre-processed.

Model predictions

Training data predictions

# Predict based on training data for the four individual models
# And the three stacked ensemble models
glmnet.train <- predict.train(ensemble.models$glmnet)
knn.train <- predict.train(ensemble.models$knn)
nnet.train <- predict.train(ensemble.models$nnet)
rf.train <- predict(ensemble.models$ranger)
ensemble.glm.train <- predict(stacked.ensemble.glm)
ensemble.glmnet.train <- predict(stacked.ensemble.glmnet)
ensemble.rf.train <- predict(stacked.ensemble.rf)
real.train <- pca.train$CDRSB

# Combine these predictions into a dataframe for easier viewing
train.df <- do.call(cbind, list(elastic.net=glmnet.train, knn=knn.train, neural.net=nnet.train, random.forest=rf.train,
                                ensemble.glm=ensemble.glm.train, ensemble.glmnet=ensemble.glmnet.train, 
                                ensemble.rf=ensemble.rf.train, real.CDR=real.train)) %>% as.data.frame()

datatable(train.df %>% mutate_if(is.numeric, function(x) round(x,4)) %>% select(real.CDR, elastic.net:ensemble.rf))

The elastic net individual model clearly predicts the same value (0.3539) for all training cases, indicating it will not be of much use to predict the CDR Sum of Boxes change.

Test data predictions

# Predict based on UNSEEN test data for the four individual models
# And the three stacked ensemble models
real.test <- pca.test$CDRSB  
glmnet.test <- predict.train(ensemble.models$glmnet, newdata=pca.test)
knn.test <- predict.train(ensemble.models$knn, newdata=pca.test)
nnet.test <- predict.train(ensemble.models$nnet, newdata=pca.test)
rf.test <- predict.train(ensemble.models$ranger, newdata=pca.test)
ensemble.glm.test <- predict(stacked.ensemble.glm, newdata=pca.test)
ensemble.glmnet.test <- predict(stacked.ensemble.glmnet, newdata=pca.test)
ensemble.rf.test <- predict(stacked.ensemble.rf, newdata=pca.test)

# Combine these predictions into a dataframe for easier viewing
test.df <- do.call(cbind, list(real.CDR=real.test, elastic.net=glmnet.test, knn=knn.test, neural.net=nnet.test, random.forest=rf.test,
                               ensemble.glm=ensemble.glm.test, ensemble.glmnet=ensemble.glmnet.test, ensemble.rf=ensemble.rf.test)) %>% as.data.frame()

datatable(test.df %>% mutate_if(is.numeric, function(x) round(x,4)))

Predicted vs. real CDR-SoB comparison

I want to compare these three ensemble models in terms of how their predictions relate to the actual CDR-SoB values in the training and testing data. I also want to compare these results with those obtained with the individual component models to see if constructing the ensemble confers a predictive advantage. First, I will visualize how the predicted values stack up to to the actual value for annual change in CDR-Sum of Boxes, in both the training and the test data:

# Create training data ggplot to be converted to interactive plotly plot
p.train <- train.df %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Training Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Create test data ggplot to be converted to interactive plotly plot
p.test <- test.df  %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Test Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Use ggplotly to create interactive HTML plots
ggplotly(p.train, height=350, width=900)
ggplotly(p.test, height=350, width=900) 

To quantify the association between real CDR-SoB values and model-predicted, I will use the \(R^2\) and RMSE values through the R2 and RMSE functions from caret:

# Calculate the RMSE between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
rmse.train <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.train, real.train),
                         ensemble.glm=RMSE(ensemble.glm.train, real.train),
                         ensemble.rf=RMSE(ensemble.rf.train, real.train),
                         elastic.net=RMSE(glmnet.train, real.train),
                         knn=RMSE(knn.train, real.train),
                         neural.net=RMSE(nnet.train, real.train),
                         random.forest=RMSE(rf.train, real.train),
                         Metric="Train_RMSE")

# Calculate the RMSE between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
rmse.test <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.test, real.test),
                        ensemble.glm=RMSE(ensemble.glm.test, real.test),
                        ensemble.rf=RMSE(ensemble.rf.test, real.test),
                        elastic.net=RMSE(glmnet.test, real.test),
                        knn=RMSE(knn.test, real.test),
                        neural.net=RMSE(nnet.test, real.test),
                        random.forest=RMSE(rf.test, real.test),
                        Metric="Test_RMSE")

# Calculate the R-squared between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
r2.train <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.train, real.train),
                       ensemble.glm=R2(ensemble.glm.train, real.train),
                       ensemble.rf=R2(ensemble.rf.train, real.train),
                       elastic.net=R2(glmnet.train, real.train),
                       knn=R2(knn.train, real.train),
                       neural.net=R2(nnet.train, real.train),
                       random.forest=R2(rf.train, real.train),
                       Metric="Train_R2")

# Calculate the R-squared between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
r2.test <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.test, real.test),
                      ensemble.glm=R2(ensemble.glm.test, real.test),
                      ensemble.rf=R2(ensemble.rf.test, real.test),
                      elastic.net=R2(glmnet.test, real.test),
                      knn=R2(knn.test, real.test),
                      neural.net=R2(nnet.test, real.test),
                      random.forest=R2(rf.test, real.test),
                      Metric="Test_R2")

# Combine all four prediction dataframes into one table to compare and contrast
# RMSE and R-squared across models
do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  mutate(Metric = str_replace(Metric, "_", " ")) %>%
  pivot_wider(id_cols="Model", names_from="Metric", values_from="Value") %>%
  mutate_if(is.numeric, function(x) round(x,4)) %>%
  kable(., booktabs=T) %>% kable_styling(full_width=F)
Model Train RMSE Test RMSE Train R2 Test R2
ensemble.glmnet 0.8693 0.8442 0.7380 0.1521
ensemble.glm 0.7375 0.8312 0.7863 0.1600
ensemble.rf 0.8609 0.8731 0.5243 0.0665
elastic.net 1.1418 0.8978 NA NA
knn 1.0950 0.9041 0.1255 0.0411
neural.net 1.1139 0.9090 0.0485 0.0003
random.forest 0.6511 0.8259 0.8860 0.1752

The random forest model shows the lowest RMSE between predicted vs. real CDR-Sum of Boxes for both the training and test data. It also has the highest \(R^2\) value between predicted and real CDR-Sum of Boxes rate of change. However, the training \(R^2\) of 0.8860 is substantially larger than the test data \(R^2\) of 0.1752, suggesting major overfitting.

In the training dataset, the random forest predictions are pretty close, with the glm- and elastic net-stacked ensemble models showing nearly comparable correlations between real vs. predicted values. However, this relationship is not evident in the testing data. Aside from the elastic net individual model, the individual and ensemble models all tend ot over-estimate CDR-Sum of Boxes for test cases with zero actual change. Beyond zero, however, the models tend to underestimate CDR-Sum of Boxes change, particular for values of ~1.5 and above.

# Compile RMSE and R2 results comparing real vs. predicted values for ensembles and component models
overall.ensemble.results <- do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  # Reshape to facet on metric -- i.e. RMSE or R2
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  separate(Metric, into=c("Data", "Metric"), sep="_")

p.ensemble.r2.rmse <- overall.ensemble.results %>%
  mutate(Metric = ifelse(Metric=="RMSE", "Real vs. Predicted CDR-SoB RMSE", "Real vs. Predicted CDR-SoB R2")) %>%
  mutate(Data=factor(Data, levels=c("Train", "Test")),
         Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Data, y=Value, color=Model, group=Model)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  facet_wrap(Metric~., scales="free", nrow=1) +
  theme(strip.text=element_text(size=12, face="bold"),
        axis.title=element_blank())

# Convert to interactive plotly plot, rename x/y axis titles
ggplotly(p.ensemble.r2.rmse) %>% 
  layout(yaxis = list(title = "R2", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "Data Subset", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "Data Subset", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

There are two clusters of model performance here: random forest, elastic net-, and glm-stacked ensemble models show high \(R^2\) in the training data, and all decrease to a very similar quantity (0.15-0.17) in the test dataset. All three show the lowest RMSE as well. By contrast, the neural network, kNN, and random forest-stacked ensemble models show low \(R^2\) and high RMSE. Interestingly, their error is higher in the training dataset.

Save data for model evaluation

Lastly, I will append the PCA dataset and ensemble model develop ed herein to an .RData file for use in Model Evaluation. First, since all of the objects have the same names across the four modeling pages, I will re-name model objects and datasets to be PCA-specific:

# Rename relevant objects to have PCA-specific names
models.for.ensemble.PCA <- ensemble.models
model.metrics.PCA <- overall.ensemble.results
stacked.ensemble.glm.PCA <- stacked.ensemble.glm
stacked.ensemble.glmnet.PCA <- stacked.ensemble.glmnet
stacked.ensemble.rf.PCA <- stacked.ensemble.rf

# If RData file already exists, use resave from cgwtools to append these objects
if (file.exists("../RData/Model_Results.RData")) {
  cgwtools::resave(models.for.ensemble.PCA, model.metrics.PCA,
                   stacked.ensemble.glm.PCA, stacked.ensemble.glmnet.PCA,
                   stacked.ensemble.rf.PCA, pca.train, pca.test, 
                   file="../RData/Model_Results.RData")
} else {
  # If the RData file does not yet exist, create it here
  save(models.for.ensemble.PCA, model.metrics.PCA,
       stacked.ensemble.glm.PCA, stacked.ensemble.glmnet.PCA,
       stacked.ensemble.rf.PCA, pca.train, pca.test,
       file="../RData/Model_Results.RData")
}

Previous page: Modeling (Original) Next page: Model Evaluation